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The prevalence of neutral mutations implies that biological systems typically have many more 
genotypes than phenotypes. But can the way that genotypes are distributed over phenotypes de¬ 
termine evolutionary outcomes? Answering such questions is difficult because the number of geno¬ 
types can be hyper-astronomically large. By solving the genotype-phenotype (GP) map for RNA 
secondary structure for systems up to length L — 126 nucleotides (where the set of all possible RNA 
strands would weigh more than the mass of the visible universe) we show that the GP map strongly 
constrains the evolution of non-coding RNA (ncRNA). Simple random sampling over genotypes 
predicts the distribution of properties such as the mutational robustness or the number of stems per 
secondary structure found in naturally occurring ncRNA with surprising accuracy. Since we ignore 
natural selection, this strikingly close correspondence with the mapping suggests that structures 
allowing for functionality are easily discovered, despite the enormous size of the genetic spaces. The 
mapping is extremely biased: the majority of genotypes map to an exponentially small portion of 
the morphospace of all biophysically possible structures. Such strong constraints provide a non- 
adaptive explanation for the convergent evolution of structures such as the hammerhead ribozyme. 
These results presents a particularly clear example of bias in the arrival of variation strongly shaping 
evolutionary outcomes and may be relevant to Mayr’s distinction between proximate and ultimate 
causes in evolutionary biology. 
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1. Introduction 

Many questions about the limits of evolution hinge not 
only what has happened in natural history, but also on 
counterfactuals: what did not happen, but perhaps could 
have. Re-run the tape of life mm and what parts of 
phenotype space - the set of all possible phenotypes [3] - 
would be occupied? Typically, only a minescule fraction 
of the phenotype space has been explored throughout 
natural history. The reasons given for this phenomenon 
usually combine adaptive arguments: some parts of phe¬ 
notype space yield higher fitness than others, with argu¬ 
ments based on contingency: nature hasn’t had time to 
explore all of phenotype space. However, evolutionary 
search does not occur by uniform sampling over pheno¬ 
types, but rather by random mutations in the space of 
genotypes. Does this basic fact alter the way that phe¬ 
notype space is explored and occupied? To answer such 
whole genotype-phenotype (GP) map questions is diffi¬ 
cult, in part because the number of possible genotypes 
typically grows exponentially with length, and so rapidly 
becomes unimaginably vast HI], and in part because bi¬ 
ological systems with sufficiently tractable GP maps are 
rare. 

One system where progress can be made is the mapping 
from sequences to the structure of RNA. Although sim¬ 
pler than many biological phenotypes, RNA is interesting 
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and important because it can fulfil multiple roles both 
as an information carrier (e.g. messenger RNA and viral 
RNA), or as functional non-coding RNA (ncRNA) [7], in¬ 
cluding chemically active catalysts (e.g. ribozymes) and 
key structural components of larger self-assembled struc¬ 
tures (e.g. ribosomal RNA). One reason RNA is so ver¬ 
satile is that it can fold into complex three-dimensional 
(3D) structures. The bonding pattern of these structures 
is called the secondary structure (SS), which is an im¬ 
portant determinant of the 3D structure and biological 
activity of ncRNA molecules, and so can be treated as a 
(simplified) phenotype in its own right [Hll] (Fig. [^. 

Fast algorithms exist to predict the free energy min¬ 
imum SS for a given sequence, and these are thought 
to be fairly accurate, especially for shorter strands [TOl 
m- Moreover, extensive databases exist for functional 
ncRNA [Tj. For these reasons, this computationally 
tractable yet biologically relevant sequence to struc¬ 
ture mapping is uniquely suited for investigating ‘whole 
genotype-phenotype map’ properties that may point to¬ 
wards general principles relevant for a wider class of sys¬ 
tems, and has been extensively studied over the last few 
decades [HI [T2U20] . 

Nevertheless, the RNA sequence to SS GP map also 
suffers from the exponential growth of the size of genetic 
space, which has limited prior comprehensive GP studies 
to fairly small lengths, making direct comparison to evo¬ 
lutionary outcomes difficult. Here we show that many 
detailed properties of this GP map can, in fact, be cal- 
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Genotype (RNA sequence) 

CAAAAGUCUGGGCUAAGCCCACUGAUG 

AGUCUCUGAAAUGAGACGAAACUUUUG 



FIG. 1. Schematic of the mapping from an RNA se- 
qnence to a SS. Here for an L = 55 type III hammerhead 
ribozyme with three stems. Any sequence that folds to the 
same SS topology is considered to map to the same SS phe¬ 
notype. 

culated, even for lengths as long as L = 126, where the 
set of all possible strands would weigh more than the 
mass of the visible universe (see Methods). One reason 
this is possible is because the map is highly biased to¬ 
wards a small fraction of phenotypes that take up the 
majority of genotypes. This bias means that sampling 
over genotypes (which we will call G-sampling) gener¬ 
ates signihcantly different outcomes than sampling over 
the morphospace [3] of all shapes (phenotypes) (which 
we will call P-sampling). The existence of such highly 
peaked distributions in the mapping from genotypes to 
phenotypes, also explains, in a way that is reminiscent 
of statistical mechanics, why sampling a relatively small 
number of genotypes is enough to determine certain key 
properties of RNA SS. 

Perhaps most strikingly, we find that the distributions 
of several properties of natural ncRNA taken from the 
function RNA database (fRNAdb) [H], including the 
number of stems, the mutational robustness and the 
number of genotypes per SS phenotype, are very simi¬ 
lar to what we obtain from random sampling over geno¬ 
types, and significantly different from uniform sampling 
over phenotypes. This result does not mean that natu¬ 
ral selection can be ignored, but rather, as we will ar¬ 
gue below, that the mapping strongly prescribes which 
parts of morphospace are presented to natural selection 
as potential variation. Variation can only be selected if 
it arrives [22l [23] . 

2. Results 

2.1 P-sampling over phenotypes differs signifi¬ 
cantly from G-sampling over genotypes. 

We first analyse an exhaustive enumeration of all se¬ 
quences for L = 20 RNA, the largest system for which 
this has so far been accomplished. The «10^^ se¬ 
quences were folded with the Vienna Package [10], and 
map to Vp=ll,218 unique bonded SS and one trivial 
structure with no bonds, as their free-energy minima m- 



FIG. 2. Comparison of P-sampled and G-sampled dis¬ 
tributions to natural data for L = 20 RNA. The P- 

sampled Pp{Q.) (red diamonds) measures the probability dis¬ 
tribution for a phenotype to have a given NS size fl. It differs 
markedly from G-sampled Pq)!!) (blue circles), generated by 
random sampling over genotypes. Error bars arise from bin¬ 
ning data. The black and cyan lines are theoretical approxi¬ 
mations to Pp{Q.) and Pg{0,) respectively (see Methods). The 
probability distribution of H for the SS all 7327 (non-trivial) 
L = 20 sequences for Drosophila melanogaster from the fR¬ 
NAdb database [21] (green squares) is much closer to the G- 
sampled Pg{D) than to the P-sampled Pp{fl). Inset: All 
11,218 SS phenotypes (purple triangles) ranked by NS size 
fl. There is strong bias, just 5% of phenotypes take up 58% 
of all genotypes. The 7327 natural data points (green squares) 
are clustered at lower rank (larger Q). 


The set of all sequences that map onto a given SS is called 
the neutral set (NS), and we use ft to denote the NS size 
(the number of sequences in the NS). This mapping ex¬ 
hibits strong phenotype bias |611I2|[I311I7H2Q]. For exam¬ 
ple, 12 varies by over ten orders of magnitude for L = 20 
(Fig. [2]). 

Next, we introduce the concept of P-sampling, that is, 
uniformly sampling over the set of possible phenotypes 
(the morphospace), and define Pp(12) as the probabil¬ 
ity distribution that a randomly chosen phenotype has 
NS size ft. We calculate distributions for fixed L and 
bin data uniformly in S' = log2^Q(12), but write Pp{n) 
and Pcift) for simplicity. Pp(fl) has a maximum when 
ft is about half of the maximum value of the exponent 
log(fl) = 17 « 10 (Fig. [^. We ignore the trivial struc¬ 
ture with no bonds for which ft = lO'^. For L — 20, 
T « 11.56; for larger L, T/U —>■ 1 and the probability of 
finding the trivial structure tends to zero (see Methods). 

Instead of P-sampling, one could also sample uni¬ 
formly over sequences (genotypes), which we refer to as 
G-sampling, giving Pc{ft) oc ftPp{ft) which highlights 
structures from the large ft tail of Pp{ft), as can be seen 
in Fig. 

Novel variation does not arise by uniform random sam¬ 
pling in the morphospace of all physically permissible 
phenotypes (P-sampling), but instead by processes such 
as mutation that change genotypes. While evolutionary 
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FIG. 3. Neutral set size distributions illustrate how bias constrains natural RNA secondary structnres. Sampled 
distributions Pp(n) (red diamonds) and Pg( 0) (blue circles) and analytic approximations to Pp{Q) (black lines) and Pg{^) 
(cyan lines) are compared to ncRNA from the fRNAdb database (green squares). For each length (denoted in the top 
corners) the data is shown both on a lin-log and log-log graph to highlight different parts of the distributions. The natural 
data is remarkably clsoe to the G-sampled Pg{0,), but quite far from the P-sampled Pp{Q). The number of natural structures 
plotted are: 7327 for L — 20 (which is repeated from Fig 2. for clarity of comparison); 658 for L = 40; 472 for L = 50; 350 for 
L = 60; 2263 for L = 70; 553 for L = 80; 731 for L = 90; 891 for L = 100; and 291 for L = 126. In each case all structures 
in the fRNAdb |21| database are used, except for L = 20 where only structures for Drosophila melanogaster are used, and for 
1/ = 55 and L = 126 where we also plot a smaller curated data set of 213 and 172 structures respectively (magenta triangles) 
where the SS is known to be important (see Supplementary Information). Smaller numbers of data lead to larger binning error 
bars, but the curated sets are clearly very similar to the full set of fRNAdb structures. 


dynamics do not proceed by simple uniform G-sampling 
either, recent detailed population genetics calculations 
for RNA [20] have shown that the rate at which novel 
variation (a particular SS) arises in an evolving popu¬ 
lation is, to first order, directly proportional to the NS 
size of the SS phenotype, and so also to Paifl). This 
proportionality holds for a wide range of mutation rates 
and population sizes. It was also demonstrated explic¬ 
itly that strong phenotype bias can overcome fitness dif¬ 
ferences in an RNA system [20]. Bias should therefore 
affect outcomes for multiple evolutionary scenarios. In¬ 
deed, important prior studies have suggested that natu¬ 
ral RNA could have larger than average ft [IZIIIHI. We 
took every L = 20 sequence in the fRNAdb database [H] 
for Drosophila melanogaster (see Methods and Supple¬ 


mentary Information) and calculated the NS size ft of its 
associated SS using the neutral set size estimator (NSSE) 
from ref. [18]. Not only are SS with larger ft overrepre¬ 
sented, but the entire natural distribution is remarkably 
close to the genotype sampled Pcift) (Fig. [^. 

2.2 Sampling for lengths up to L = 126. 

While these results for L = 20 are suggestive, distri¬ 
butions for larger length RNA are needed to ensure we 
are not just observing database biases or artefacts of the 
short length. Since the number of sequences grows ex¬ 
ponentially with increasing L, exhaustive enumeration is 
not an option for lengths much larger than L = 20. In¬ 
stead, we estimate the NS size distributions by randomly 
sampling genotypes, folding them into a SS, and then 
measuring their NS size with the NSSE (Methods). This 
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process naturally generates Pg{^); -Pp(14) can be backed 
out by dividing by 14 and normalising. However, it is hard 
to sample SS with small H so /p(H) is only partially de¬ 
termined. To make progress we use a simple analytical 
ansatz based on a log-binomial approximation to the dis¬ 
tributions (see Methods), which works well both Poi^) 
and Hp(H) for L = 20 (Fig. [^. We compare this ap¬ 
proximation to sampled data for L = 20 up to L = 126 
(Fig. I and Supplementary Figs 1,2). In each case the 
analytic fit to Pci^) is excellent, and the fit to Pp(H) 
works well, giving confidence that this form provides a 
reasonable approximation to the full Pp(H). 




FIG. 4. Rank plots for L = 55 RNA. Q v.s. rank plot for 
all Np « 4 X 10^^ possible SS structures for L = 55 shown 
as as (a) log-lin and (b) log-log plots. The black line is from 
the analytical approximation and the green squares denote 
the natural data. The horizontal red dot-dashed line is H = 
10®'^ « 3 X 10^®, near the peak of the Pg(H) distribution 
for L=55 in Fig 3. The 13 L = 55 hammerhead ribozyme 
structures from the fRNA database (red squares) are clustered 
near this peak. The vertical blue dotted line in (b) denotes 
l3Np = 2^ ~ 4 X 10®, the “effective” number of SS. This set 
of /3 ~ 0.1% of all structures captures the majority (« 75%) 
of natural structures. 

2.3 Entropy and the bias ratio. 

The Pg(H) distributions can be further quantified by the 
Shannon entropy H = — P{Pk) log 2 {P{Pk))^ where 

P{pk) is the probability of choosing one of the Np pos- 




FIG. 5. Natural stack distributions for ncRNA corre¬ 
late with G-sampling but not with P-sampling. (a) 

Distribution of the number of stacks for L — 55 via G- 
sampling (blue circles) and P-sampling (black diamonds). 
Natural data (green squares) are remarkably close to the G- 
sampled distribution, and are drawn from a tiny fraction of 
the full morphospace of structures (the P-sampled distribu¬ 
tion) with small numbers of stacks. The red diamonds are 
from a combinatorial estimate of stack number |24| that helps 
corroborate our P-sampling result, (b) Comparison to exper¬ 
imentally measured structures: The brown squares denote 
the number of stacks experimentally determined for 214 non- 
pseudoknotted structures with L > 20 which were taken from 
RNA STRAND database [^. The cyan diamonds show Kc, 
the mean number of stacks calculated by G-sampling with 
the Vienna package [ID]. which can be accurately fit with 
Kg = 0.074L — 0.377. The blue diamonds show the G- 
sampled number of stacks one standard deviation above or 
below from the mean. The natural data (brown squares) 
from the RNA STRAND database are consistent with the G- 
sampled theoretical data. By contrast these natural data are 
far away from the expected number of stacks from P-sampling, 
shown here for an estimate that uses the linear relationship 
between the P-sampled distribution of log(H) and K (Supple¬ 
mentary Fig. 6) to infer Kp (red diamonds), which are well 
described by a linear fit Ap = 0.177L — 0.443 (solid red line). 
Independent estimates of Ap come from reference [24], and 
include an asymptotic measure Kp « 0.1717L (dash-dotted 
line) (see Supplementary Information). The close agreement 
between the two independent methods for estimating the 
mean number of stacks from P-sampling gives us further con¬ 
fidence in our fits to the full P-sampled distribution. 
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FIG. 6. Natural robustness distributions for ncRNA 
correlate with G-sampling but not with P-sampling. 

The distribution of robustness, defined as the fraction of mu¬ 
tations r that retain the same SS phenotype is given for 
L = 55 via P-sampling (black line) and G-sampling (blue 
circles). Natural data (green squares) are remarkably close 
to the G-sampled distribution, and are considerably more ro¬ 
bust than the average of structures in the full morphospace. 
We also note that most phenotypes have a robustness that is 
above the threshold {r > 5 ~ 0.0061) needed for the formation 
of connected neutral networks | 26| . 

sible phenotypes pk by G-sampling. The exponential of 
the entropy, 2^, is used in statistical physics and infor¬ 
mation theory m as a measure of the effective number 
of states. Since P{pk) = it follows straightfor¬ 

wardly that 

2 -^^ = ( 1 ) 

where Sq is a G-sampled average of S' = log(n). Thus 
to measure the ‘effective number’ of phenotypes that 
take up the majority of genotypes, one only needs to 
determine Sc- This can be done rapidly because the 
G-sampled distributions are sharply peaked in a manner 
reminiscent of statistical mechanics m- 

As an example of how the highly peaked nature of these 
distributions facilitates the calculation of averages, con¬ 
sider the case of L = 126. We find that Sc ~ 51.5 with 
standard deviation ctg « 2.1 so that 95% of SS structures 
found by G-sampling have S in range 51.5 ± 4.2 which 
is very narrow compared to the full range [0,64.5], and 
comparatively far (about 6 standard deviations) from the 
phenotype averaged 5p « 38.6. Rather strikingly, this 
implies that even though the set of all possible L = 126 
nucleotide sequences would weigh more than the mass of 
the observable universe (see Methods), sampling the NS 
size for just 10 randomly chosen L = 126 structures is 
enough to fix the exponent Sc to within about 1.3% rel¬ 
ative accuracy, which allows us to determine, for example 
the number of relevant states 2^ via Eq[Tl 

To further quantify the bias, we introduce the bias ratio 
P G (0,1] as 

( 2 ) 


which can be interpreted as the ratio of the effective num¬ 
ber of phenotypes that take up most of the sequences to 
the total number of phenotypes Np. For example, if all 
n are equal then 2'^=2*°S2(^j=)=]Vp, /3 = 1, and bias is 
weak. On the other hand, if just one phenotype accounts 
for nearly all the genotypes then 2^«2°=1, /3 —>■ 0 and 
bias is very strong. 

The number of phenotypes Np can be estimated from 
Pp{fl); Np « 0.13 X 1.76^ fits the data well (see Meth¬ 
ods). From this we find /3 «0.25x0.91^ which shrinks 
with increasing L. Typically we find that about 75% 
of genotypes map to the ‘effective number’ PNp = 2^ of 
structures. Nevertheless, PNp = 2^ ra 0.033x1.60^, and 
continues to grow exponentially. For L = 55, /3 « 0.001 
and PNp « 4 x 10® of the Np « 4 x 10^® phenotypes 
take up about 75% of the genotypes. For L = 126, 
/? « 2 X 10“® and PNp « 1 x 10®^. Even though bias 
greatly reduces the effective number of phenotypes ac¬ 
cessible to mutations, their number continues to grow 
exponentially, and so can still be extremely large for this 
system. 

2.4 Comparison to fRNAdb database of func¬ 
tional non-coding RNA. 

To explore how this bias plays out in nature, we took, 
for lengths ranging from L = 20 to L = 126, every se¬ 
quence in the fRNAdb database [3T] and calculated the 
NS size n of its associated SS. The correspondence be¬ 
tween Pc{i2) from random sampling and the distribution 
of n from the database is striking (Fig.j^and Supplemen¬ 
tary Figs. 1,2). Not only are these natural RNAs mainly 
drawn from the minuscule fraction /3 of SS with large O, 
but their distribution is remarkably close to what one 
would obtain from randomly sampling genotypes. Rank 
plots for L = 55 (FigQ further illustrate how natural 
ncRNA are constrained mainly to the set of PNp = 2^ 
‘relevant structures’. 

These results are, at first glance, surprising because 
many ncRNAs have well-defined functional roles and so 
will have been subject to natural selection for which 
there exists extensive evidence [7]. Some insight may be 
gleaned from an exchange that occurred not long after 
the start of the molecular revolution in biology. Frank 
Salisbury |4j expressed doubt that evolutionary search 
could work, since genetic spaces are typically exponen¬ 
tially large, and, he claimed, are probably sparsely pop¬ 
ulated with functional phenotypes. In an famous re¬ 
sponse [5], John Maynard Smith pointed out that evo¬ 
lution is aided by neutral mutations because these imply 
many fewer phenotypes than genotypes and also connect 
genotypes into neutral networks that can be explored by 
evolving populations. Phenotype bias may further facili¬ 
tate evolutionary search by limiting potential outcomes. 
But taken together, this still does not counter the heart 
of Salisbury’s objection, namely that functional pheno¬ 
types are extremely rare and so hard to find. 

Here we instead suggest that the reason the ‘null 
model’ Pc{^) predicts what is found in nature so accu¬ 
rately is that ‘good enough’ SS are relatively easy to find. 


P = 2^/Np « 0.25 X 0.91^ 
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Of course the full functional ncRNA phenotype typically 
has a smaller NS size than the SS does because it requires 
further constraints on the sequence to achieve function¬ 
ality [S]. We postulate that while natural selection cre¬ 
ates function by acting on such sequence constraints, it 
also automatically draws from a palette of easily acces¬ 
sible SS variation that is strongly pre-sculpted by the 
mapping (see also ref. [T3] for a similar suggestion). An¬ 
other possibility might be that the correspondence with 
Pg{^) simply means that the SS has no adaptive value. 
However, this scenario is unlikely, since it is well estab¬ 
lished that the SS is an important determinant of the 
3D structure la 0 i which, in turn, helps determine 
function. Moreover, the correlation with Pg{^) remains 
strong when we curate the database for structures where 
SS is known to be important. Altogether this suggests 
that RNA SS that facilitate biological function are - con¬ 
tra Salisbury - not rare, at least not within the set of ‘rel¬ 
evant structures’ most easily accessible to random muta¬ 
tions. 

Salisbury’s arguments are also undermined by in- 
vitro evolution experiments that selected random RNA 
strands for self-cleaving catalytic activity and found that 
the hammerhead ribozyme repeated emerged, suggesting 
convergent evolution |28j . The hammerhead ribozyme 
appears so frequently in all three kingdoms of life that 
it has recently been termed ubiquitous |29j . We exam¬ 
ined the 13 hammerhead ribozymes from the natural 
data set [2T] for L = 55, finding that S = log(H) « 
23.87 ± 0.64 1^ very close to the peak of the G-sampled 
distribution at Sq ~ 23.4 (see L=55 panel in Fig. and 
Fig. 4). We postulate that evolutionary convergence is 
observed in these experiments and in nature not so much 
because the hammerhead ribozyme is htter than other 
possible self-cleaving enzymes, but rather because it is 
particularly easy to find. There may even be many other 
self-cleaving ribozymes among the 99.9% of L = 55 struc¬ 
tures that evolution is unlikely to search through . It 
would in fact be interesting to devise artificial methods 
to search for such undiscovered ribozymes [30l El] . 

2.5 Distribution of stacks. 

Are the 2^ relevant structures different from the whole 
set of Np possible structures? Our arguments above sug¬ 
gest that this should be the case for properties that cor¬ 
relate with neutral set size. Previous studies (typically 
on much smaller data sets) have found that structural 
features (e.g. distributions of stack and loop sizes) of the 
natural and random SS are quite similar m. and that 
natural and random rRNA share strong similarities in 
the sequence nucleotide composition of SS motifs such as 
stems, loops, and bulges m, although natural RNA are 
more stable than random RNA M- Indeed we find that 
the natural RNA have slightly more bonds than in G- 
sampled structures (see Supplementary Information Fig. 
S3 (c),(d)). 

We find that il correlates negatively with the num¬ 
ber of stacks (i.e. sets of contiguous base-pairs) K (see 
Supplementary Information Fig. S6). The natural distri¬ 


bution of stacks closely follows the G-sampled distribu¬ 
tion, but differs markedly from the P-sampled distribu¬ 
tion (Fig 4a). For example, the hammerhead ribozyme 
has 3 stems, close to the most likely number by ran¬ 
dom G-sampling for L = 55, but much less than the 
P-sampled average of « 10. Bias means that it will be 
difficult for evolution to find L = 55 structures with a 
large number of stacks, again raising the question of what 
kind of functionality is possible in principle that cannot 
be reached by evolution because of such phenotype bias 
constraints. 

As an independent check on our stack predictions, we 
also obtained 214 natural experimentally determined SS 
from the STRAND RNA database [5S] and plot the ex¬ 
perimentally determined number of stacks versus length 
L in Fig. The majority of experimentally deter¬ 
mined structures have numbers of stacks that are within 
one standard deviation of the mean calculated from G- 
sampling, as one would expect if our theoretical predic¬ 
tions were accurate. By contrast, the number of exper¬ 
imentally determined stacks differs signihcantly from P- 
sampling estimates. 

2.6 Distribution of mutational robustness. 

Interestingly, the bias towards larger ft also leads to 
structures with larger mutational robustness Eiiia, and 
again natural data closely follow G-sampled distributions 
(Fig. 1^ Larger robustness is considered to be advanta- 
geous^0 so that in this important way phenotype bias 
facilitates evolution. It is also interesting to note just 
how large the mutational robustness is for this data. For 
L = 55 there are on the order of Np « 8 x 10^^ phe¬ 
notypes, so that the mean probability of finding a phe¬ 
notype by randomly picking a genotype is on the order 
of 1 X 10“^^. Instead, for both the G-sampled and the 
P-sampled structures, the probability of a nearest neigh¬ 
bour generating the same phenotypes is on the order of 
10^^ times higher than random chance. As recently em¬ 
phasised in ref ESI this large difference arises from ge¬ 
netic correlations, and typically lifts the robustness well 
over the minimal threshold S = 1/3L [S] (5 « 0.0061 
for L = 55) needed to generate large connected neutral 
networks. 

3. Discussion 

By solving properties of the GP map from sequences 
to RNA secondary structures for strands up to L = 126 
nucleotides in length, we show explicitly that the vast ma¬ 
jority of sequences map to an exponentially small fraction 
of all possible phenotypes (a summary of scaling forms for 
key properties of RNA can be found in Tablejl]). One con¬ 
sequence of this strong bias is that only an exponentially 
small proportion of the morphospace of possible struc¬ 
tures can ever be presented to natural selection. Even 
if one could re-run the tape of life over again multiple 
times, many structures that are physically feasible and 
probably biochemically functional are extremely unlikely 
to appear simply because they are inaccessible to evolu¬ 
tionary search. 
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While the existence of bias in the RNA GP map has 
been known for quite some time [6l [I 2 H 2 Q], and stud¬ 
ies using smaller amounts of natural data have sug¬ 
gested that aspects of the mapping are reflected in na¬ 
ture [13 m EMU], our ability to calculate GP map 
properties for much larger L allows us to make a com¬ 
prehensive comparison to the fRNAdb database for func¬ 
tional ncRNA m- While one might expect that, due to 
bias, large structures are relatively more plentiful in 
nature [niEHi, perhaps the most surprising result of this 
study is just how closely the natural data follows the pre¬ 
diction of uniform G-sampling over genotypes. We find 
that the distribution of neutral set size 17, the number 
of stacks S and the mutational robustness r of naturally 
occurring ncRNA all closely follow the G-sampled dis¬ 
tributions, and deviate significantly from the P-sampled 
distributions. The distribution of bonds also deviates 
strongly from the P-sampled distribution, but in contrast 
to the other distributions above, it also exhibits a small 
deviation from the G-sampled distribution, with natural 
RNA having slightly more bonds [T4| (see Supplementary 
Information Fig S3 (c),(d)). 

Why does it appear as if we can virtually ignore nat¬ 
ural selection with G-sampling? We postulate that even 
though the number of relevant structures remains ex¬ 
tremely large, secondary structures that are good enough 
for function are nonetheless abundant, most likely be¬ 
cause small differences between them don’t matter that 
much. Instead, some broader coarse-grained structural 
features are likely to be sufficient for many functional 
roles. We suggest that natural selection works on this 
pre-sculpted variation mainly by further refining parts 
of the sequence. Some evidence for this picture can be 
gleaned by the fact that natural structures have slightly 
more bonds than G-sampled RNA structures do, suggest¬ 
ing selection for greater thermal stability [14] . 

Interestingly, the fact that many properties of ncRNA 
SS so closely follow the G-sampled distribution is direct 
evidence against claims that the hyper-astronomically 
large size of genotype space makes functional structures 
virtually impossible to find. Such assertions were perhaps 
most famously made by Frank Salisbury [4] over 45 years 
ago. Despite an illuminating response by John Maynard 
Smith [ 5 ], and much experimental evidence to the con¬ 
trary, such “arguments from large numbers” remain a 
popular trope in anti-evolutionary polemics today. 

The ease with which we can calculate distributions of 
RNA properties suggests an analogy to the concept of 
ergodicity in statistical physics, which means that en¬ 
semble averages are equivalent to time-averages. When 
ergodicity (approximately) holds, then it is often true 
in practice that sampling (via computer simulation for 
example) a relatively small number of ’’typical” states 
(small compared to the total number off possible mi¬ 
crostates) is sufficient to accurately calculate ensemble 
averages of key properties. Something akin to ergodicity 
may be operating in our case case because statistical av¬ 
eraging via G-sampling is close to the snapshot of many 


trajectories over time that we observe in the database, 
the latter being analogous to a time-average. On the 
other hand, evolution is not an equilibrium process. In¬ 
deed, the exponential bias in the GP map suggests that 
evolutionary waiting times for more rare phenotypes to 
appear also grows exponentially [20] ■ Thus the number of 
novel phenotypic possibilities will continue to (slowly) in¬ 
crease with time. With large waiting times contingency 
is more likely play an important role. So if a certain 
evolutionary change is predicated on one of these rare 
phenotypes fixing, then the process of waiting for these 
innovations may be non-ergodic. 

The suggestion that biases in development or other in¬ 
ternal processes could strongly affect evolutionary out¬ 
comes has traditionally been highly contentious, see 
e.g. refs. [33 and [33 for an overview. For example, in 
a recent exchange, entitled: “Does evolutionary theory 
need a rethink?”, Laland and colleagues [34] argue in 
favour of the thesis by, amongst other things, advocating 
for the importance of developmental bias. In their rejoin¬ 
der, Wray, Hoekstra and colleagues, write [35]: “Lack of 
evidence also makes it difficult to evaluate the role that 
developmental bias may have in the evolution (or lack of 
evolution) of adaptive traits”, and call for new evidence: 
“The best way to elevate the prominence of genuinely in¬ 
teresting phenomena such as ... developmental bias ... is 
to strengthen the evidence for their importance.” While 
the RNA system we study here is much simpler than a 
typical developmental system, that vice is also a virtue 
because it allows us to make detailed calculations of the 
whole morphospace which can then be closely compared 
to natural data. Thus RNA secondary structure provides 
perhaps the clearest and most unambiguous evidence for 
the importance of bias in shaping evolutionary outcomes. 

Bias in the GP map constrains outcomes and so natu¬ 
rally suggest one mechanism for homoplasy (where sim¬ 
ilar biological forms evolve independently) [36] • The 
causes of homoplasy are sometimes elaborated in the 
context of the difference between parallel evolution, 
where homoplasy is thought to occur because two organ¬ 
isms share a common genetic heritage, and convergence 
proper, where the same solution is found by different ge¬ 
netic means, and where the primary causal force is usu¬ 
ally attributed to selection. While this binary distinction 
may be too simplistic (see e.g. refs 1331 - 1331 for some recent 
discussion), very roughly, parallel evolution is thought 
to be considerably weaker evidence than true convergent 
evolution is for the idea that re-running the tape of life 
would generate similar outcomes. Interestingly, the bias 
in the GP map discussed here does not fit into this two¬ 
fold demarcation at all. Re-run the tape of life, and as 
long as RNA graces the replay, so will a very similar 
suite of molecular shapes. But the reason for this repeti¬ 
tion is not a contingent common genetic history, nor the 
Allmacht of selection [40], but rather a different kind of 
‘deep structure in biology’ [4T] . 

Our ability to make detailed predictions about evo¬ 
lutionary outcomes as well as counterfactuals for RNA 


may also shed light on Mayr’s famous distinction between 
proximate and ultimate causes in biology [42] , which has 
been the subject of much recent debate in the litera¬ 
ture [33]. This distinction has historically been used to 
argue against the role of developmental bias in determin¬ 
ing evolutionary outcomes (see e.g. |33| and |44|). While, 
as also mentioned above, the RNA GP map is much sim¬ 
pler than a typical developmental system, it is instruc¬ 
tive to consider how the bias described in the current 
paper plays into Mayr’s ultimate-proximate distinction: 
For example, what is the cause of convergent evolution 
of the hammerhead ribozyme [33] ? The ultimate cause 
that this ribozyme emerges in populations or in in-vitro 
experiments is surely natural selection for self-cleaving 
catalytic activity. But why is a 3-stack structure re¬ 
peatedly found, and not say a 10-stack structure, even 
though the latter are much more common in the mor- 
phospace? The cause here is not natural selection per se 
since it is unlikely that an efficient 10-stack ribozyme is 
biophysically or biochemically impossible to make. In¬ 
stead the explanatory force [H] |45] for the “why” ques¬ 
tion is mainly carried by a proximate GP mapping con¬ 
straint, namely that the frequency with which 10-stack 
structures arise as potential variation is many orders of 
magnitude lower than the frequency with which 3-stack 
structures do. Since the fittest can only be selected and 
survive if they arrive in the first place |33J [33] , the evolu¬ 
tionary mechanism that leads to convergence here might 
be better termed the arrival of the frequent [20] . 

We also note that the mapping constraint described 
here differs from classical physical constraints, which 
would act on the whole morphospace, and from phyletic 
constraints, which are contingent on evolutionary histo¬ 
ries |4t>j . This mapping constraint has some resemblance 
to classical morphogenetic constraints which also bias the 
arrival of variation m- But it also differs because the 
the latter are conceptualised at the level of phenotypes 
and developmental processes, and may have been shaped 
by prior selection, while the former constraint is a funda¬ 
mental property of the mapping from genotypes to phe¬ 
notypes and was not selected for (except perhaps at the 
origin of life itself). 

Finally, strong phenotype bias is also found in model 
GP maps for protein tertiary [H] [49] and quaternary 
structure m, gene regulatory networks [SJ |S3], and 
development [53], suggesting that the some of the re¬ 
sults discussed in this paper for RNA may hold more 
widely in biology. While all these GP maps only cap¬ 
ture a tiny fraction of the full biological complexity of 
an organism, if the bias is also (exponentially) strong, 
then its effects on the rate with which novel variation 
appears are likely to persist even when further biological 
details are included. Although more evidence needs to 
be gathered before making firm pronouncements, it may 
well be that we need to let go of the commonly held ex¬ 
pectation that variation is isotropic in phenotype space 
or morphospace [54] . Or to put it more provocatively, 
perhaps our null models should by default assume that 


variation is highly anisotropic and biased towards certain 
outcomes over others, unless there is direct evidence to 
the contrary. 


4. Methods 

Folding RNA structures To map a sequence to a SS, 
we use the Vienna package m with all parameters set 
to their default values (e.g. the temperature T = 37°G). 
This software employs dynamic programming techniques 
to efficiently fold sequences based on thermodynamic 
rules. Methods of this type are widely used, have been 
extensively tested, and are thought to be relatively ac¬ 
curate. For example, a related method [55] was recently 
shown to correctly predict 73 ±9% of canonical base pairs 
for a database of known RNA structures up to lengths of 
700 nucleotides. Generally these methods are expected 
to work better for shorter strands |56| . and should work 
well for the lengths we explore. However, they typi¬ 
cally cannot correctly predict pseudoknots. Knowledge 
based methods that also take input from known struc¬ 
tures and other information may be more accurate for 
predicting the structures of individual sequences [57], 
but such methods could introduce biases for genotype- 
phenotype maps because they take input from natural 
structures. Thermodynamically based methods such as 
the Vienna package are therefore probably better suited 
for working out global properties of the entire genotype- 
phenotype map, including structures that have not (yet) 
been found in nature. For these reasons, this package 
has been most frequently used in studies of full genotype- 
phenotype maps [6] [T2U2n] . To check our Vienna package 
results, we compared RNAstructure package [55] calcu¬ 
lations for the G-distributions of stacks and bonds (Sup¬ 
plementary Figure 3), finding very similar distributions. 
We also compared the two packages for other motifs like 
bulges, loops, and junctions, finding again very similar 
predictions (graphs are not shown). 

Exponential growth of the number of strands with 
strand length L To illustrate how rapidly the number 
of sequences grows with length consider the following: 
There are L4^ nucleotides in the set of all possible se¬ 
quences of length L. The mean mass of a single RNA 
nucleotide is about 5 x 10“^^ kg so that, for example, the 
set of all L = 55 strands weighs about 3.8 x 10^° kg, the 
set of all L = 79 strands weighs about 1.5 x 10^® kg, or 
about 2.6 times the mass of our Earth, while the set of 
all L = 126 RNA strands would have an almost unimag¬ 
inably large mass of about 5 x 10®^ kg, more than the 
mass of the observable universe which is estimated to be 
about 10®^ kg. 

Generating distributions P^)!!) and Pp(r2) by 
sampling We used a standard (Python) random number 
generator to create sets of random sequences. For each 
sequence we used the Vienna package to find the lowest 
free energy SS. To determine H for each SS we used the 
neutral set size estimator (NSSE) described in ref. [TH] 
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which employs sampling techniques together with the in¬ 
verse fold algorithm from the Vienna package. We used 
default settings except for the total number of measure¬ 
ments (set with the -m option) which we set to 1 instead 
of the default 10, for the sake of speed. We checked that 
this has a negligible effect (typically < 1%) on the ac¬ 
curacy of our distributions. We also checked the NSSE 
against the full enumeration for L = 20, finding an agree¬ 
ment of = 0.97 for structures with larger than the 
average; it performs slightly less well for rare structures. 
For longer lengths the number of samples were: 10® for 
L = 30, 3 X 10® for L = 40, 20,000 for L = 35 - 80, 5000 
for L = 85 — 100 and 1000 for L = 126.. Sequences that 
generate the trivial structure are discarded. A small frac¬ 
tion of sequences (which increases with increasing length) 
were also discarded due to the inverse folding package 
failing to converge (See Supplementary Information). 

To generate the Pg{^) distribution, we partition the 
support of the distribution into bins which are uniform 
on an ^ = logio(n) scale. We then determine the prob¬ 
ability mass Pai^) in each bin. Error bars are simply 
statistical: There is a tradeoff between making smaller 
bins to give a greater resolution and minimising statis¬ 
tical errors that increase when there are fewer measure¬ 
ments per bin. Bins with too few sampled points were 
typically not included in the graphs to avoid large er¬ 
ror bars. Pp{ft,) is generated from the sampled data by 
dividing though by (measured at the midpoint of the 
bin). Pp(n) is normalised with the Np calculated from 
the analytic approximation to Pp(Q) (see below). 

Analytical fit to NS size distributions For analytic 
fits we make a simple log-binomial ansatz: 

Pp(fl)= (^^^(pp)«(l-pp)^-9 (3) 

where q = \og{^)N/U, 10^ is the largest non-trivial fl, 
and N and pp are parameters that are fit to measured 
distributions. In other words the probability that a P- 
sampled SS is found with S = log(il) is distributed bino- 
mially. By definition Pg{^) oc ^lPp{^l). Taking normali¬ 
sation into account is enough to show that Pg{^) has the 
same binomial form as Eq. (1^, but with parameter pp 

replaced by pg = (ppl0^/^)^l— pp-|-ppl0^/^). Fixing 
the parameters N, U and either pG or pp thus fixes both 
distributions. 

For L = 20, Eq. ^ with U = 10, N = 8.0 and 
Pp = 0.55 describes the exact Pp{il) from full enumer¬ 
ation very well, as can be seen in Fig. lb. The related 
approximation for Pg(H) performs slightly less well for 
G-sampled data for L = 20, but it still captures the main 
qualitative features. 

For larger L we determine the parameters as follows: 
First, we estimate U from the largest non-trivial NS size 
found. This method inevitably provides a lower bound 
U' on the true maximum U. However, given the rather 
sharp upper bound generated by the binomial fit, we ex¬ 
pect that the relative errors in U are quite small. For 


example, for L = 60 we used 20,000 samples to deter¬ 
mine U' = 30.56. From the binomial form we estimate 
that U' is within an error 5U = 0.45 of the true U with 
90% probability. Next we calculate Sg, the G-sampled 
average of S' = log(H), as well as the standard deviation 
ctg in S from G-sampled data. We can then determine 
the parameter pg = Sg/U, derived by taking the mean of 
q through Eq. ([^. The parameter N can subsequently be 
extracted from the measured G-sampled standard devi¬ 
ation: aG = \/Apg(1 — pg)U/N. The P-sampled stan¬ 
dard deviation is given by ap = ^/Npp{l — pp)U/N. 
Since pp = Sp/U < pg, so also ap < aG- In this way we 
obtained the binomial fits shown in Figj^and Supplemen¬ 
tary Figs 1,2. The close agreement between the sampled 
data and our fits for lengths up to L = 126 suggests that 
this procedure is fairly robust. 

It is harder to find structures with small H, so that 
the Pp{fl) can only be partially sampled, especially for 
larger L. However, given how well our simple ansatz 
works for predicting Pg{^), and given that for L = 20 
RNA the binomial form works so well for the full range of 
structures, we expect the full Pp(H) to be at least similar 
if not very close to Eq. ([^ . Further evidence for this form 
can also be extracted from combinatorial arguments for 
the distributions of stacks [21] , which are correlated with 
log(fl) (see below). 

Rank plots Analytic rank-plots functions are the cumu¬ 
lative density function of Pp{ft). For L = 20 all struc¬ 
tures are known, so a rank plot can be directly made. For 
L > 20 the measured ft of natural SS were used to align 
points to a rank function calculated from the analytic 
binomial fit to Pp{fl). 

Scaling forms as a function of L We further used 
the lengths L = 30 — 100 to extract scaling forms for 
several properties as a function of L. Linear fits in L are 
shown for U, T, Sg and N in Supplementary Fig. 5. T 
is close to U so that the relative difference between T 
and U decreases as L increases (Supplementary Fig. 5). 
A summary of the scaling forms for different properties 
in the large L limit can be found in Table [^ 

Probability Px to find the trivial structure The 

probability of finding the trivial structure, Pp, decreases 
exponentially with increasing L: 

Pp = 10'^/4^ « 21.4 X 0.82^ (4) 

For example, for L = 20 we can directly measure Pp « 
33% whereas for L = 55 it has already dropped down 
to a mere 0.04%. This rapid decrease in Pp justifies our 
decision to ignore the trivial structure in our fitting to 
Pg{S) and Pp{S). 

The number of SS, Np, as a function of L For 

L = 11 — 20 we used full enumerations [20] to calculate 
Np. For longer L we used our analytic fit as follows: The 
mean (including trivial structure) of ft is /Np so that 
(1 — Pp)A^/Np = Given the binomial 

form of Pp{ft), the sum can be carried out analytically. 
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Quantity 

Large L scaling form 

Total number of genotypes 

Ng = 

.4^ 

Total number of SS phenotypes 

Np s 

s 0.13 X 1.76^ 

Mean D 

4^/Ap « 7.7 X 2.27^ 

Largest non-trivial D 

10'^ : 

0.7 X 3.27^ 

12 for the trivial structure 

10"^ : 

a 21.4 X 3.29^ 

Probability to sample trivial structure 

Pt- 

i 21.4 X 0.82^ 

12 near peak for phenotype sampling 

10®’^ 

« 0.01 X 2.1^ 

12 near peak for genotype sampling 

lO^G 

« 30.2 X 2.5^ 

Shannon entropy of distribution 

H as 

0.675L - 4.92 

‘Effective number’ of SS phenotypes 

pNp 

« 0.032 X 1.60^ 

Bias parameter 

0 = 

/K ~ 0.25 X 0.91^ 

Np 

G-sampled mean number of stacks 

Kg = 

= 0.074Z/ - 0.377 

P-sampled mean number of stacks 

Kp = 

= 0.177L - 0.443 


TABLE I. Large L scaling of some key quantities for 

RNA SS. For the number of sequences for the NS of the 
largest non-trivial structure, defined as Q, — 10^, we find 
U = (0.514 ± 0.009)1/ — (0.20 ± 0.5), while for the trivial 
structure, with fl = 10^, we find T = (0.5166 ± 0.0009)L + 
(1.33±0.06) so that U becomes relatively more close to T as L 
increases. The G-sampled mean of S = log(f2) scales as Sg = 
(0.399 ± 0.0014)L + (1.48 ± 0.09). For large Sg « 0.78 U 
while for P-sampling, Sp « 0.63(7 so that Sp/Sq ~ 0.8. 
Similarly, the standard deviations of log(r2) can be directly 
calculated, and in the large L limit tend to apjU ~ Q.Z7jy/L 
and ogIV ~ 0.31/%/!/. This explains analytically what can be 
observed qualitatively in Fig. 3 and Supplementary Figs. 1,2: 
The Pg{^) distribution is slightly narrower than the Pp(n) 
distribution. As L increases both distributions become more 
sharply peaked relative to the total range [0,(7] and Pg{0,) 
highlights SS phenotypes that are deeper into the tails the 
Pp(fl) distribution (and vice versa). 

from which it follows that Np = (1 — Pt)4^/(1 — Pp + 
pplO^^^)^ . At each L we evaluated n/Pt,pp and (7, 
and used this to evaluate Np. A simple linear form 

Np « (0.13 ± 0.04) X (1.760 ± 0.007)^ (5) 

provides a good fit to the data, ( Supplementary Fig. 5). 
Scaling forms for the Shannon entropy H and bias 
parameter /3 as a fnnction of L From the expression 
for the entropy derived above it directly follows that H — 
2L — log2(10).§G. Using the fit derived above for Sq, the 
entropy grows with L as H k, 0.675L — 4.92 and the 
effective number of states scales as 2^ = 4^/10'®'^ « 
0.033 X 1.60^ (This ignores the trivial structure, but the 
effect is very small for larger L). Note that one only 
has to find Sg, which is easily obtainable, to determine 
this important quantity. By combining with Eq. ([^ we 
find that the bias parameter scales as /? = 2^ /Np « 
0.26 X 0.91^. 

Sampling natnral RNA from the fRNA database 

To generate the distributions of functional ncRNA we 
took all available sequences for each length studied (rang¬ 
ing from L = 40 to L = 126) from the non-coding func¬ 


tional RNA database (fRNAdb [21] )■ For L = 20 we took 
data from Drosophila melanogaster only, but this made 
up 77% of all L = 20 SS in the database. For each se¬ 
quence we found the SS and used the NSSE to estimate 
its n. A small fraction of the natural RNA sequences 
contained non-standard nucleotide letters, e.g. N or R; 
such sequences were ignored, because the standard pack¬ 
ages cannot treat them. Similarly, a small fraction of 
sequences were also discarded due to the NSSE failing to 
converge (see SI). We checked that there were no repeated 
sequences in the database of natural RNA. Finally, in the 
Supplementary Information we provide a breakdown of 
the identity of the structures for L = 20, 55, 70 and 126. 
and also take curated subsets of the data to emphasise 
structures where the SS is known to be important. In 
Fig. 3 and Supplementary Figs 1,2, we show for L = 55 
and L = 126 that the close correlation with Pg{^) re¬ 
mains when data is curated. 

Checking for codon bias Genetic mutations are ran¬ 
dom in the sense that they don’t arise to benefit an or¬ 
ganism. Nevertheless, it is well known that in other ways 
mutations are not uniformly random |58j . One example 
(among several) is that transitions (pyramidines O pyra- 
midines or purines O purines) are more frequent that 
transversions (purines -f-)- pyramidines) [58]. For many 
of these biases, mutations can still effectively sample the 
whole space uniformly without preferring certain geno¬ 
types over others. Nevertheless, there are biases that 
lead, for example, to an excess of GC over AT base pairs, 
or vice versa in DNA [^. To test for the effect of strong 
bias of this type we generated 0 distributions with 30% 
GC (AU bias) and 70% GC (GC bias) content for differ¬ 
ent lengths. The overall effect becomes less pronounced 
for longer strands (Supplementary Fig 4). Since natu¬ 
ral DNA (and by extension RNA) can show biases for 
both larger and smaller GC content, we argue that this 
can to first order be ignored when comparing to natural 
data sets across many species, although effects may be 
observable on datasets with large content bias. 
Calculating P-distribution of stacks We used a lin¬ 
ear relationship between log(D) and K (Supplementary 
Fig. 6) to transform Pp(Sl) into an estimated P-sampled 
distribution of stacks, as shown in Fig.4 a, and used this 
to obtain the P-sampled average Kp. We also adapted 
analytic results from ref [23] based on combinatorics (see 
SI) to calculate estimates for the P-sampled distribution 
of K and for Kp. The close agreement shown in Fig 4 
a,b, between the two independent methods for estimat¬ 
ing the number of stacks from P-sampling increases our 
confidence in our binomial fits to the full P-sampled dis¬ 
tribution. 

Calculating the robustness distribution A strong 
positive correlation between mutational robustness and 
D is well established in the literature [B]. For Fig 6, 
the robustness for G-sampled data (taken from 10® ran¬ 
dom sequences) and natural data (taken from the the 504 
natural sequences for L = 55), was calculated by folding 
all 3L strands within one point mutation and calculat- 
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ing the fraction r that generate the same SS. Error bars 
come from our binning procedure. For the P-sampled 
distribution such a sequence robustness can’t be defined, 
so instead the robustness per sequence was hrst averaged 
over a set of sequences for each secondary structure to 
estimate a mean robustness (phenotype robustness) per 
SS. We next generated a cubic ht to the mean r per SS 
v.s. logfl. The fit was constrained to have r = 0 at 
= 1 and had = 0.93. This was then combined with 
Pp(fi) to generate an estimate of the r distribution for P- 
sampled data. Since we don’t have many data points at 
small n, the ht is partially an extrapolation. Further er¬ 
ror may arise from the partial sampling of the phenotype 


robustness. Nevertheless, we don’t expect this procedure 
to lead to a signihcant difference in the qualitative com¬ 
parison between P- and G-sampled data for robustness, 
given that the two are so different. 
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Supplemental Materials: The structure of the genotype-phenotype map strongly 
constrains the evolution of non-coding RNA 

I. SUPPLEMENTARY METHODS 

Correlation between number of stacks and fl In Supplementary Fig. S6 we plot the number of stacks (i.e. 
contiguous base pairs), AT, in sampled SS, which is linearly correlated with log(O). Only two lengths are shown, but 
the correlation is very similar for all lengths. 

P-distribution of stacks from combinatorial analyses Another approach to calculating the P sampled number 
of stacks can be found in an important paper by Hofacker et al |24j . who analyse all possible SS under various 
constraints. Their arguments essentially count all ways of connecting bonds together, given physical constraints such 
as a minimum loop size m. This set is not the same as the set of all structures that sequences fold to, as there 
may be structures that are not the lowest free-energy structure for any sequence at the temperature investigated. 
Nevertheless, It is instructive to compare the distributions they derive to our P-sampling results. 

More specifically, Eqs. 6 and 7 of Hofacker et al [24] give a recursion formula for the number of SS of length L with 
K stacks, denoted Ni^{K) : 

L-l K 

Nl+i{K) = Nl{K) + EE Zk+2{l)NL-k-i{K-l) (SI) 

k—m l—O 

for K > 0 and L > m + 1 (where m is the minimum loop size), and 

NLiO) = l,NLiK)=0,K >0,L<m + l (S2) 

Here Zl{K) is an auxiliary variable defined as 

Zl^K) = Zl-2{K) + Nl-2{K - 1) - Zl-2{K - 1) (S3) 

with Zo{K) = Zi{K) = 0. 

These recursion relationships can be solved to find the estimated stack distribution as a function of length. We 
note that in order to implement their recursion formulas we had to make two simple minor adjustments to the 
original formulae, both of which were confirmed by Peter Stabler, one of the original authors of ref. [24] . in a private 
communication. The adjustments were to set Zl{Q) — 0, and to set ZL{b) — {) \i L < m + 2. With this in hand, we 
calculated these distributions using the constraint that the minimum loop size is m = 3, which is also a constraint 
in the Vienna package. We find that the distributions are unimodal and narrowly peaked. The peak is roughly 
mid-way between the minimum value of K (which is 0) and maximum value of AT, which we found to be « L/3. 
More precisely, we calculated the mean Kp for lengths 20... 200, which can be fit to the following linear relationship: 
Kp = (0.19152 ± 0.0001)L — (0.476 ± 0.001). The standard deviation can be fit quite accurately with axp = 0.17VX 
so that the width of the stack distribution, divided by the mean number of stacks, goes down as 1/vT for large L: 
the peaks become relatively more sharp. This scaling of the distribution width with L is the same as that found for 
Pp{cr) and for the P-sampled stack distribution described above an shown in Fig 3 of the main text. 

Hofacker et al [24] also gave analytic asymptotic large L estimates for the P-mean number of stacks. These estimates 
varied slightly depending on assumptions such as the value of m, and the type of bonding the RNA sequence was 
assumed to make. The most relevant of these estimates for our work is a P-mean which assumes the minimum loop 
size is TO = 3, and that permits G-U bonds. The corresponding large L estimate is Kp = 0.1717L. 

These two analytic approximations for Kp are fairly close to our estimate from combining the linear relationship 
between K and S = log(H) with Pp(il), namely Kp = 0.177L — 0.443. It is gratifying to find that these independent 
estimates, one using our empirical relationship between K and log(H) together with a fit to Pp(f2), and the others from 
combinatorics, are so similar. This gives us confidence that the P-sampled mean numbers of stacks is considerably 
larger than the G-sampled Kq^ which can be approximated as Kq = 0.074L — 0.377, and which is close to the mean 
number of stacks found in the RNA STRAND database for solved structures. All these different strands of evidence 
strengthen the case for our simple binomial fit approximation for Pp(r2), which we could only partially sample. 


II. FRNADB DATASETS 


In this section, we describe in more detail the contents the natural datasets for L = 20, 55, 70 & 126, taken from 
fRNAdb [^ database. 
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A. L = 20 

The full fRNAdb database [2T] had 14350 RNA sequences of length L = 20 when we accessed it in August 2014. 
They were categorised as follows in the fRNAdb file: 


17 % Putative conserved noncoding region (EvoFold) 

4.3 % Piwi-interacting RNA (piRNA) 

1.2 % Mature microRNA 

77 % Fly small RNA 

total number of sequences = 14350 

The vast majority of these sequences came from the fruit fly Drosophila melanogaster. The sequences labeled as 
‘putative conserved noncoding region’ are derived by bioinformatic methods and are expected to be functional ncRNA, 
but the function still needs to be confirmed. We therefore removed these sequences from our L — 20 data set. We 
also removed the piRNA for which it is not yet clear that the SS is important. We only used “Fly small RNA” for 
L = 20 from D. melanogaster^ which reduced the total number of sequences to 11, 050. Of these, about 34% map to 
the unbound trivial structure, which is close to what is expected from random G-sampling since the trivial structure 
takes up about Pt = 33% of sequences in the mapping for L = 20 (this percentage drops rapidly for increasing L). 
It remains somewhat unclear what fraction of the 7327 non-trivial structures for D. melanogaster have SS that are 
important for function. Nevertheless, we used this whole set to compare to the sampled structures. 


B. L = 55 

The contents of the fRNAdb file for L = 55, when we accessed it in August 2014, were: 

0.4 % self-splicing ribozyme RNA 

1.2 % non-protein coding (noncoding) transcript 
0.4 % HIV gag stem loop 3 (GSL3) 

0.2 % small nucleolar RNA (snoRNA) SNORD82 / SNORD83A / SNORD83B / U82 / U83A / U83B / Z25 
0.2 % Japanese encephalitis virus (JEV) hairpin structure 
0.2 % precursor micro RNA (miRNA) mir-BART2 
0.8 % guide RNA (gRNA) 

16.9 % Unagi (eel) L2 (UnaL2) LINE 3’ element 
3.7 % small nucleolar RNA (snoRNA) 

1.0 % Simian virus 40 late polyadenylation signal (SVLPA) 

0.2 % small nuclear RNA (snRNA) 

0.6 % Pyrococcus C/D box guide small nucleolar RNA (snoRNA) 

40.9 % Putative conserved noncoding region (EvoFold) 

0.2 % C/D box guide small nucleolar RNA (snoRNA) HBII-240 / SNORD72 
16.3 % Putative conserved noncoding region (RNAz) 

0.2 % C/D box guide small nucleolar RNA (snoRNA) HBII-210 / SNORD69 
0.2 % C/D box small nucleolar RNA (snoRNA) HBII-429 / SNORDIOO 

2.4 % Trans-activation response element (TAR) 

0.2 % No description given 

1.0 % small non-messenger RNA (snmRNA) 

2.2 % Hammerhead ribozyme (type HI) 

0.2 % C/D box guide small nucleolar RNA (snoRNA) Z266 
0.2 % small nucleolar RNA (snoRNA) snoR185 

0.2 % predicted precursor micro RNA (miRNA) HP-67 [false negative] 

0.2 % Rous sarcoma virus (RSV) primer binding site (PBS) 

0.4 % Ribosomal protein L13 leader 

6.5 % HIV Ribosomal frameshift signal 

0.8 % C/D box small nucleolar RNA (snoRNA) 

0.2 % putative archaeal H/ACA box small nucleolar RNA (snoRNA) u-46 

1.2 % Croup H intron 

0.2 % C/D box guide small nucleolar RNA (snoRNA) HBH-239 / SNORD71 
0.6 % Bovine leukaemia virus RNA packaging signal 
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0.2 % sok RNA 

Total number of sequences = 509 


We first note that over half (41%, 16% and 0.2%) are labelled putative. Of the remaining non-putative RNA 
described above, 17% are eel LINE elements, which are known to have a stem-loop structure that is important for 
function. The next most abundant (7%) are HIV ribsomal frameshift signal RNA, whose structure is important 
for interacting with the ribosome. Next is snoRNA (4%) which has conserved and functionally relevant secondary 
structures. The Hammerhead ribozyme (2.2%) and Group 2 intron (1.2%) are both catalytic ribozymes and so have 
SS important for their functions. Smaller abundance contributions include Bovine leukaemia virus signal, precursor 
RNA, JEV hairpin structure, TAR and RSV, all of which are known to have functionally relevant SS. In sum, for this 
dataset, a significant fraction of the non-putative RNA have known functionally relevant SS. 

In the main text Fig 2, we plot the NS size distribution for 504 structures. There were 5 structures (about 1%) 
which either had non-standard nucleotides or else the NSSE estimator did not converge. We also investigated the 
L = 55 dataset with all putative RNA discarded (leaving 213 sequences remaining). As argued above, most of the 
rest of the sequences have SS that are thought to be important for function. The distributions for this curated set of 
natural RNA structures are not notably different from the distributions which included putative ncRNA and therefore 
remain very close to the G-sampled distribution (Extended Data Figs. 1,2.) 


C. L = 70 

The contents of the fRNAdb file for A = 70 we accessed August 2014 are: 


0.04 % precursor micro RNA (miRNA) mir-99b 

0.04 % small nucleolar RNA (snoRNA) Esmeraldo SLAl 

0.91 % transfer RNA (tRNA), GCC (Gly/G) Glycine 

0.3 % non-protein coding (noncoding) transcript 

0.04 % precursor micro RNA (miRNA) mir-N367 

0.69 % transfer RNA (tRNA), GAT (Met/M) Methionine 

0.04 % precursor micro RNA (miRNA) mir-140 

0.09 % G/D box guide small nucleolar RNA (snoRNA) SNORD38 / U38 
0.04 % predicted precursor micro RNA (miRNA) RP-88 [false negative] 

0.04 % precursor micro RNA (miRNA) mir-16c 
0.04 % precursor micro RNA (miRNA) mir-24-1 
0.04 % precursor micro RNA (miRNA) mir-303 

0.04 % G/D box guide small nucleolar RNA (snoRNA) SNORD18A / U18A 
0.04 % G/D box guide small nucleolar RNA (snoRNA) Z37 
1.09 % transfer RNA (tRNA), TTG (Glu/E) Glutamic acid 
5.12 % Putative conserved noncoding region (EvoFold) 

0.04 % precursor micro RNA (miRNA) mir-K12-10b 

0.04 % precursor micro RNA (miRNA) mir-200b 

0.04 % transfer RNA (tRNA), GCG (Arg/R) Arginine 

0.04 % G/D box small nucleolar RNA (snoRNA) HBII-429 / SNORDIOO 

0.04 % precursor micro RNA (miRNA) mir-8 

0.04 % precursor micro RNA (miRNA) mir-K12-3 

0.04 % precursor micro RNA (miRNA) mir-199a-l 

0.61 % 

0.13 % small non-messenger RNA (snmRNA) 

0.17 % G/D box guide small nucleolar RNA (snoRNA) SNORD113 / SNORD114 
0.04 % precursor micro RNA (miRNA) mir-196 / mir-196-1 / mir-196a-l 
0.04 % predicted precursor micro RNA (miRNA) HP-27 [false negative] 

0.13 % G/D box guide small nucleolar RNA (snoRNA) SNORD82 / U82 / Z25 
0.04 % precursor micro RNA (miRNA) mir-412 
0.04 % Plasmid R1162 RNA 

2.78 % transfer RNA (tRNA), TCG (Arg/R) Arginine 
0.04 % precursor micro RNA (miRNA) mir-23a 
0.04 % precursor micro RNA (miRNA) mir-K12-8 

0.04 % G/D box guide small nucleolar RNA (snoRNA) SNORD63 / U63 
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0.04 % precursor micro RNA (miRNA) mir-30 / mir-30d 
0.04 % precursor micro RNA (miRNA) mir-US25-l 
0.3 % precursor micro RNA (miRNA) lin-4 

0.04 % predicted precursor micro RNA (miRNA) MP-64 [false negative] 

0.04 % C/D box guide small nucleolar RNA (snoRNA) Z40 
0.13 % transfer RNA (tRNA), GGT (Thr/T) Threonine 
0.04 % precursor micro RNA (miRNA) mir-lOb-2 
14.41 % transfer RNA (tRNA), TGG (Pro/P) Proline 

0.04 % C/D box guide small nucleolar RNA (snoRNA) SNORD2 / snR39B 
0.04 % predicted precursor micro RNA (miRNA) HP-61 [all confirmed] 

0.04 % precursor micro RNA (miRNA) mir-128a 
1.13 % transfer RNA (tRNA), TGC (Ala/A) Alanine 
0.09 % Enterovirus 5’ cloverleaf cis-acting replication element 
0.04 % S-element 

0.04 % C/D box guide small nucleolar RNA (snoRNA) R38 
0.09 % transfer RNA (tRNA), CAA (Leu/L) Leucine 
0.04 % small nucleolar RNA (snoRNA) 

1.17 % transfer RNA (tRNA), TGA (Ser/S) Serine 

0.04 % C/D box guide small nucleolar RNA (snoRNA) SNORD18B / U18B 
0.04 % C/D box small nucleolar RNA (snoRNA) 14q(I-l) / SNORD113-1 
0.04 % C/D box guide small nucleolar RNA (snoRNA) SNORD41 / U41 
0.04 % precursor micro RNA (miRNA) mir-US33 
3.69 % Putative conserved noncoding region (RNAz) 

0.13 % Selenocysteine insertion sequence 

0.04 % Tombusvirus internal replication element (IRE) 

0.04 % precursor micro RNA (miRNA) mir-lOb-1 

0.04 % C/D box guide small nucleolar RNA (snoRNA) SNORD34 / U34 
0.04 % C/D box guide small nucleolar RNA (snoRNA) SNORD51 / U51 
0.3 % HIV primer binding site (PBS) 

0.04 % suhB 

0.52 % HIV gag stem loop 3 (GSL3) 

0.04 % precursor micro RNA (miRNA) mir-219 

1.17 % transfer RNA (tRNA), GAA (Phe/F) Phenylalanine 

0.22 % C/D box small nucleolar RNA (snoRNA) 

0.09 % C/D box guide small nucleolar RNA (snoRNA) SNORD77 / U77 

0.04 % sar RNA 

3.6 % Group II Intron 

0.04 % C/D box guide small nucleolar RNA (snoRNA) Me28S-Cm788 

0.13 % transfer RNA (tRNA), TCT (Arg/R) Arginine 

0.09 % Vimentin 3’ untraslated region (UTR) protein-binding region 

1.3 % transfer RNA (tRNA), GAT (Ile/I) Isoleucine 

0.04 % transfer RNA (tRNA), AGA (Ser/S) Serine 

0.09 % precursor micro RNA (miRNA) mir-383 

0.04 % precursor micro RNA (miRNA) mir-M30 

0.13 % C/D box guide small nucleolar RNA (snoRNA) SNORD30 / U30 
0.04 % small nucleolar RNA (snoRNA) Sylvio XIO SLAl 
0.04 % sroB RNA 

0.04 % small nuclear RNA (snRNA) 

0.04 % transfer RNA (tRNA), CAC (Val/V) Valine 

0.22 % C/D box guide small nucleolar RNA (snoRNA) SNORD18 / U18 

6.77 % Selenocysteine transfer RNA (tRNA), TCA 

0.13 % precursor micro RNA (miRNA) mir-30 

0.04 % precursor micro RNA (miRNA) mir-K12-10a 

0.17 % C/D box guide small nucleolar RNA (snoRNA) U2-30 

0.09 % precursor micro RNA (miRNA) mir-BARTl 

0.04 % predicted precursor micro RNA (miRNA) RP-104 [false negative] 
0.35 % RNA-OUT 

0.04 % transfer RNA (tRNA), GGC (Ala/A) Alanine 



5 


0.04 % transfer RNA (tRNA), AGC (Ala/A) Alanine 
10.76 % transfer RNA (tRNA), GTA (Tyr/Y) Tyrosine 
0.17 % Rous sarcoma virus (RSV) primer binding site (PBS) 

0.04 % G/D box guide small nucleolar RNA (snoRNA) SNORD58 / U58 
4.9 % transfer RNA (tRNA), TAG (Val/V) Valine 
0.04 % precursor micro RNA (miRNA) mir-694 

0.09 % G/D box guide small nucleolar RNA (snoRNA) SNORD24 / U24 
0.09 % precursor micro RNA (miRNA) mir-145 

0.04 % G/D box guide small nucleolar RNA (snoRNA) SNORD56 / U56 

0.87 % Tombus virus defective interfering (DI) RNA region 3 

0.04 % precursor micro RNA (miRNA) mir-30d 

0.43 % Flavivirus DB element 

0.09 % precursor micro RNA (miRNA) mir-361 

10.11 % transfer RNA (tRNA), TTG (Gln/Q) Glutamine 

0.04 % transfer RNA (tRNA), GGG (Pro/P) Proline 

0.35 % transfer RNA (tRNA), GTT (Asn/N) Asparagine 

0.04 % transfer RNA (tRNA), GGT (Ser/S) Serine 

0.04 % precursor micro RNA (miRNA) mir-K12-4 

0.69 % transfer RNA (tRNA), GTG (Asp/D) Aspartic acid 

0.04 % transfer RNA (tRNA), AGG (Pro/P) Proline 

3.26 % transfer RNA (tRNA), TAG (Leu/L) Leucine 

0.04 % small nucleolar RNA (snoRNA) Trypanosoma brucei RNA 9 (TBR9) 

0.04 % precursor micro RNA (miRNA) mir-125b 

0.04 % transfer RNA (tRNA), AGC (Gly/G) Glycine 

0.04 % precursor micro RNA (miRNA) mir-137 

0.09 % transfer RNA (tRNA), CCC (Gly/G) Glycine 

0.04 % C/D box guide small nucleolar RNA (snoRNA) SNORD73 / U73 

0.04 % precursor micro RNA (miRNA) mir-127 

0.04 % transfer RNA (tRNA), ACA (Cys/C) Cysteine 

0.04 % precursor micro RNA (miRNA) mir-M3 

0.04 % msr RNA 

0.04 % guide RNA (gRNA) 

0.35 % transfer RNA (tRNA), TAA (Leu/L) Leucine 

0.04 % predicted precursor micro RNA (miRNA) HN-3 [false negative] 

0.04 % Glycine riboswitch 
0.04 % Y RNA 

0.13 % precursor micro RNA (miRNA) mir-183 

3.26 % transfer RNA (tRNA), TGT (Thr/T) Threonine 

0.04 % precursor micro RNA (miRNA) mir-206 

0.04 % small nucleolar RNA (snoRNA) SNORD21 / U21 

0.26 % transfer RNA (tRNA) with undetermined isotype 

0.13 % C/D box guide small nucleolar RNA (snoRNA) Z12 

0.04 % precursor micro RNA (miRNA) mir-13b-2 

0.04 % precursor micro RNA (miRNA) mir-K12-5 

0.52 % transfer RNA (tRNA), CTT (Lys/K) Lysine 

0.17 % precursor micro RNA (miRNA) mir-32 

0.04 % small nucleolar RNA (snoRNA) CL Brenner SLAl 

0.04 % precursor micro RNA (miRNA) mir-147-1 / mir-147-2 

0.04 % precursor micro RNA (miRNA) mir-790 

0.04 % precursor micro RNA (miRNA) mir-302a 

0.09 % C/D box guide small nucleolar RNA (snoRNA) SNORD50 / SNORD50A / Lf50 

0.13 % transfer RNA (tRNA), GCA (Cys/C) Cysteine 

0.04 % precursor micro RNA (miRNA) mir-369 

5.69 % transfer RNA (tRNA), GTG (His/H) Histidine 

0.09 % transfer RNA (tRNA), CCA (Trp/W) Tryptophan 

0.04 % small nucleolar RNA (snoRNA) SNORD50A / U50 

0.26 % Retroviral Psi packaging element 

0.95 % transfer RNA (tRNA), TTT (Lys/K) Lysine 



6 


0.26 % precursor micro RNA (miRNA) mir-10 
3.43 % transfer RNA (tRNA), TCC (Gly/G) Glycine 
0.04 % C/D box guide small nucleolar RNA (snoRNA) R160 
Total number of sequences = 2304 

The descriptions above show that the largest category comes from tRNA, which clearly has a functionally relevant 
structure. Additionally there are many precursor microRNA, Group 2 introns, and snoRNA, which are thought have 
functionally relevant structures. Finally, this dataset contains a relatively small proportion (^9%) of ‘putative’ RNA. 
In sum, for L — 70, the vast majority of RNA in the dataset have SS that are thought to be important for function. 
In the main text Fig 2, we plot the NS size distribution for 2263 structures. There were 41 structures (1.8%) either 
with non-standard nucleotides or where the NSSE estimator did not converge. Given that the majority of structures 
have SS thought to be important for function, we did not separately plot curated data as we did for L = 55 and 
L = 126. 


D. L = 126 

The contents of the fRNAdb file for L = 126 we accessed August 2014 are: 
1.57 % 5S ribosomal RNA 

0.31 % H/ACA box small nucleolar RNA (snoRNA) 

0.31 % predicted precursor micro RNA (miRNA) HP-58 [false negative] 

0.31 % BC200 RNA 

1.57 % Thiamin pyrophosphate (TPP) riboswitch (THI element) 

0.31 % H/ACA box guide small nucleolar RNA (snoRNA) SNORA68 / U68 
0.63 % H/ACA box guide small nucleolar RNA (snoRNA) ACA27 / SNORA27 
5.33 % 

2.82 % non-protein coding (noncoding) transcript 
0.31 % precursor micro RNA (miRNA) mir-29a-2 
0.31 % precursor micro RNA (miRNA) mir-393 

0.31 % H/ACA box guide small nucleolar RNA (snoRNA) ACA28 / SNORA28 
0.31 % Pseudomonas small noncoding RNA (sRNA) P9 
15.36 % 5.8S ribosomal RNA (rRNA) 

0.31 % precursor micro RNA (miRNA) mir-394a 
0.31 % Picornavirus internal ribosome entry site (IRES) 

0.31 % precursor micro RNA (miRNA) mir-1223 

0.31 % H/ACA box guide small nucleolar RNA (snoRNA) SNORA70 / U70 
1.25 % Ribosomal protein L20 leader 

0.31 % H/ACA box guide small nucleolar RNA (snoRNA) ACA46 / SNORA46 

0.31 % precursor micro RNA (miRNA) mir-29a-l 

0.31 % precursor micro RNA (miRNA) mir-399b / mir-399c 

0.31 % U4 spliceosomal RNA 

0.31 % precursor micro RNA (miRNA) mir-859 

0.31 % precursor micro RNA (miRNA) mir-171e 

0.31 % HgcE RNA 

7.52 % small nuclear RNA (snRNA) 

0.31 % Enteroviral 3’ untraslated region (UTR) element 
8.15 % Putative conserved noncoding region (EvoEold) 

0.31 % Threonine operon leader 

0.31 % precursor micro RNA (miRNA) mir-860 

0.31 % HgcC family RNA 

33.86 % Putative conserved noncoding region (RNAz) 

0.63 % ydaO / yuaA leader 

0.31 % precursor micro RNA (miRNA) mir-169p 
0.31 % precursor micro RNA (miRNA) mir-396b 
0.31 % precursor micro RNA (miRNA) mir-169g 
0.31 % precursor micro RNA (miRNA) mir-866 
0.63 % Hepatitis C alternative reading frame stem-loop 
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0.31 % precursor micro RNA (miRNA) mir-169c 
0.63 % small non-messenger RNA (snmRNA) 

0.31 % FMN riboswitch (RFN element) 

0.31 % H/ACA box guide small nucleolar RNA (snoRNA) ACA42 / SNORA42 
0.31 % ylbH leader 

0.94 % H/ACA box guide small nucleolar RNA (snoRNA) ACA25 / SNORA25 
0.63 % Ribosomal protein LIO leader 

0.94 % C/D box guide small nucleolar RNA (snoRNA) SNORD14 / U14 
0.31 % H/ACA box guide small nucleolar RNA (snoRNA) ACA44 / SNORA44 
0.31 % precursor micro RNA (miRNA) mir-164 
0.31 % repair RNA 

0.31 % H/ACA box guide small nucleolar RNA (snoRNA) ACA58 / SNORA58 
0.31 % precursor micro RNA (miRNA) mir-190 
0.31 % GcvB RNA 

0.31 % precursor micro RNA (miRNA) mir-172b 

0.31 % C/D box guide small nucleolar RNA (snoRNA) SNORD118 / U8 
0.31 % precursor micro RNA (miRNA) mir-166g 

0.31 % H/ACA box small nucleolar RNA (snoRNA) ACA63 / SNORA77 

0.31 % precursor micro RNA (miRNA) mir-156e 

0.31 % precursor micro RNA (miRNA) mir-399e 

0.63 % C/D box guide small nucleolar RNA (snoRNA) R9 

0.63 % Group H intron 

0.63 % yybP-ykoY leader 

0.31 % H/ACA box guide small nucleolar RNA (snoRNA) ACA61 / SNORA61 
0.63 % C/D box guide small nucleolar RNA (snoRNA) SNORD22 / U22 
0.94 % H/ACA box guide small nucleolar RNA (snoRNA) E3 / SNORA63 
0.31 % H/ACA box guide small nucleolar RNA (snoRNA) ACA40 / SNORA40 
Total number of sequences = 319 


In the main text, Fig, 2, we plotted 291 of the 319 structures. For 28 structures (9.6%) either there were non¬ 
standard nucleotides or the NSSE estimator did not converge. We note that the NSSE has more difficulty converging 
for longer L, which is one reason this length was the longest we study here. About 40% of the structures are labelled 
putative, that is they are expected to be functional ncRNA, but the function still needs to be confirmed. When these 
were left out of the data, we ended up with 172 sequences for which the NS size of the structures could be found. Erom 
their descriptions above, most have SS that are thought to be important for function. As can be seen in Extended 
Data Eigs 1,2, The NS size distribution for this curated data set is very close to the G-sampled estimate and the the 
full data set. 

One concern could be that some sequences in the database are from closely related organisms, which could bias the 
data. To double check, we took one sequence for each of the 66 categories above from the L = 126 fRNAdb file. Of 
these, 1 sequence contained non-standard nucleotides, and 4 failed in the NNSE. This leaves 61 sequences. The mean 
S = log!/ for these 61 sequences is S' = 51.92, and the standard deviation is 1.76. This is very close to the results 
obtained when using all 291 sequences that we could fold where we find that the mean is S = 51.94, and the standard 
deviation is 1.94. Both these results are close to our estimates from 1000 randomly sampled L = 126 sequences where 
we found Sg = 51.5 with standard deviation ac = 2.1. 


III. SUPPLEMENTARY FIGURES 



j - analytical fit to Pp(f2) L=20"_ 
analytic fit to PQ(^^) I 

• P-sampled Pp(f2) . 

• G-sampled PqC^) , 

.0.5-" natural data ^ - 



Neutral set size (H) 















Figure SI. Neutral set size distributions for RNA secondary structures. Data as in the main text, Fig. 3, but with 
a larger set of lengths. Randomly sampled sequences are used to estimate Pp(fl) (red diamonds) and PG{fl) (blue circles). 
Black and cyan lines are theoretical approximations to Pp{Q) and Pg{0,) respectively. Green squares denote the probability 
distribution for natural ncRNA secondary structures taken from the fRNAdb m database (green squares), which are close 
to the G-sampled Pg{^), but quite far from the P-sampled Pp{Q). The number of natural structures plotted are: 7327 for 
L = 20; 658 for L = 40; 533 for L = 45; 472 for L = 50; 504 for L = 55; 350 for L = 60; 529 for L = 65; 2263 for 
L = 70; 1385 for L = 75; 553 for L = 80; 893 for L = 85; 731 for L = 90; 558 for L = 95; 891 for L = 100; and 291 
for L = 126. In each case all structures in the fRNAdb |21| database are used, except for L = 20 where only structures for 
Drosophila melanogaster are used, and for L = 55 and L — 126 where we also plot a curated data set of 213 and 172 structures 
respectively (magenta triangles) where the SS is known to be important (see subsection 0. Smaller numbers of data lead to 
larger binning error bars, but the curated sets are clearly very similar to the full set of fRNAdb structures. 
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Figure S2. Neutral set size distributions for RNA secondary structures Same data as Supplementary Fig SI, but 
on a log-log scale. 
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Figure S3. Comparison of Vienna package and RNAstructure package predictions. In each case green symbols 
are for natural sequences, with their SS and number of stacks calculated with the respective method and the black symbols 
are for G-sampled sequences, (a) Stacks for L = 100 (Vienna) (b) Stacks for L = 100 (RNAstructure) (c) Bonds for L = 100 
(Vienna) (d) Bonds for L = 100 RNAstructure. Note the overall good agreement between the two methods. While the natural 
and G-sampled stack distributions are very close, the natural RNA have slightly more bonds (about 2.5 more) on average 
than the G-sampled distributions do (28.9 v.s. 26.3 respectively for the Vienna package and 29.3 v.s. 26.8 respectively for 
RNAstruture), in agreement with the expectation that they are slightly more stable than G-sampled structures |14| . Similar 
agreement between the two methods is found for other motifs such as junctions, bulges and loops, where again natural RNA 
distributions are very close to G-sampled distributions. 
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Figure S4. PG(f2) with 30% GC (AU bias) or 70% GC (GC bias). The G-sampled distributions were taken from 
5000 samples for each bias and length. These are compared to the unbiased results obtained from 3 x 10® samples for L = 40 
and 2 x 10'* samples for T = 60 (blue circles) and our approximation for Pa{0,) (cyan line). Strong AU bias (brown diamonds) 
favours structures with slightly larger NS size, while strong GG bias (red squares) favours structures with slightly smaller NS 
size. Although the overall effect is rather small, it would be interesting to compare RNA structures for genomes that contain 
a large bias to see if such shifts can be observed. 
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Figure S5. Measured or extracted properties compared to linear fits as a function of RNA strand length L. Left 
figure: Green diamonds are the number of SS, Np, from full enumerations (up to L = 20). Red circles are Np extracted from 
sampling and fitting to a binomial form for longer L (Methods). Np = 0.13 x 1.76^ (straight black line) provides a good fit 
{R^ = 0.9986) to the data. Right figure: The logarithm of the largest non-trivial NS size, log(n) = U, (blue diamonds) can be 
fit with U = (0.514 ± 0.009)L — (0.20 ± 0.5) (blue line). The logarithm of the NS size of the trivial structure, log(n) = T, (red 
squares) can be fit with T = (0.5166 ± 0.0009)1/ + (1.33 ± 0.06) (red line), and is close to U. The mean So of the G-sampled 
distribution of S' = log(f2) (green triangles) can be fit with So = (0.399 ± 0.0014)L + (1.48 ±0.09) (green line). Measured data 
for the binomial fit parameter N (black circles) can be fit with N = 1.681/ — 43 (black line) for N > 40. For lower N a linear 
fit does not capture the data as well. For these shorter length there tends to be more structure in the measured G-distribution 
than in the smoother binomial fit. 




Number of stacks K 


Figure S6. The number of stacks (continuous sets of base-pairs) K . (a) For L = 55, the relationship between NS size 
and number of stacks (green squares), calculated by G-sampling, can be fit with log(f2) = 28.053 — 1.22K {B? = —0.84) (black 
line), (b) For L — 100 we find log(fl) = 49.238 — 1.114R' {R^ = 0.80). Similar linear correlations are found for other lengths. 








